Table of Contents

Galaxy for virologist training Exercise 2: Quality control and trimming

Despite the improvement of sequencing methods, there is no error-free technique. A correct measuring of the sequencing quality is essential for identifying problems in the sequencing, thus, this must be the first step in every sequencing analysis. Once the quality control is finished, it's important to remove those low quality reads, or short reads, for which a trimming step is mandatory. After the trimming step it is recommended to perform a new quality control step to be sure that trimming worked.

1. Illumina Quality control and trimming

Title Pre-processing
Training dataset: PRJEB43037 - In August 2020, an outbreak of West Nile Virus affected 71 people with meningoencephalitis in Andalusia and 6 more cases in Extremadura (south-west of Spain), causing a total of eight deaths. The virus belonged to the lineage 1 and was relatively similar to previous outbreaks occurred in the Mediterranean region. Here, we present a detailed analysis of the outbreak, including an extensive phylogenetic study. This is one of the outbreak samples.
Questions:
  • How do I check whether my Illumina data was correctly sequenced?
  • How can I improve the quality of my data?
Objectives:
  • Perform a quality control in raw Illumina reads
  • Perform a quality trimming in raw Illumina reads
  • Perform a quality control in trimmed Illumina reads
Estimated time: 25 min

1. Quality control

To run the quality control over the samples, follow these steps: 1. Create a new history, as we explained yesterday named Illumina preprocessing 2. Upload data as seen yesterday, copy and paste the following URLs:

ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR531/002/ERR5310322/ERR5310322_1.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR531/002/ERR5310322/ERR5310322_2.fastq.gz
  1. Search for the fastqc tool and select FastQC Read Quality reports and set the following parameters:

fastqc_run

To see the results we are going to open the jobs with Web page in their name for both data 1 and data 2.

fastqc_results_r1r2

Here, you can see the number of reads in each file, the maximum and minimum length of all reads in the sample, and the quality plots for both R1 and R2. They look quite good, but we are going to run trimming over the samples.

How many reads do the samples have?
265989

First question

How do I check whether my Illumina data was correctly sequenced?
Using FastQC

2. Trimming

Once we have performed the quality control, we have to perform the quality and read length trimming:

  1. Search for fastp in the tools and select fastp - fast all-in-one preprocessing for FASTQ files
  2. Select custom parameters:
  3. Finally, click on Execute

fastp_1

fastp_2

fastp_3

fastp_4

To see the trimming stats, have a look at the fastp on data 2 and data 1: HTML report file. You should see something like that.

fastp_results

How many reads have we lost?
98664 reads

Other trimming tools

  1. Search for trimmomatic in the tools and select Trimmomatic flexible read trimming tool for Illumina NGS data
  2. Select custom parameters:
  3. Select Execute

trimmomatic_1

trimmomatic_2

Trimmomatic does not perform statistics over trimmed reads, so we need to perform FastQC again over the Trimmomatic results.

Try to do it on your own.

fastqc_trimmomatic

fasqc_trimming_res

Second question

How can I improve the quality of my data?
Using a trimming software, such as fastp or trimmomatic.

2. Nanopore Quality control and trimming

Title Galaxy
Training dataset: The data we are going to manage corresponds to Nanopore amplicon sequencing data using ARTIC network primers por SARS-CoV-2 genome. From the Fast5 files generated by the ONT software, we are going to select the pass reads, so they are already filtered by quality.
Questions:
  • How do I know if my Nanopore data was correctly sequenced?
Objectives:
  • Perform a quality control in raw Illumina reads
  • Perform a quality trimming in raw Nanopore reads
  • Perform a quality control in trimmed Nanopore reads
Estimated time: 15 min

1. Quality control

To run the quality control over the samples, follow these steps: 1. Create a new history has explained yesterday named Nanopore quality 2. Upload data as seen yesterday, copy and paste the following URLs:

https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/nanopore/minion/fastq_pass/barcode01/FAO93606_pass_barcode01_7650855b_0.fastq
https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/nanopore/minion/fastq_pass/barcode01/FAO93606_pass_barcode01_7650855b_1.fastq
https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/nanopore/minion/fastq_pass/barcode01/FAO93606_pass_barcode01_7650855b_2.fastq
  1. Search for the Nanoplot tool and select NanoPlot Plotting suite for Oxford Nanopore sequencing data and alignments
  2. Run the tool as follows:

preproc_nanopore_nanoplot_run

Now we are going to have a look to the results.

  1. Select the :eye: icon in the NanoPlot on data 3, data 2, and data 1: HTML report result.
  2. Have a look to the stats.

nanoplot_results

As you can see, the Mean read length is around 500 nt, which makes sense because we are using amplicon sequencing data.

How many reads do the samples have?
3K reads

First question

How do I check whether my Nanopore data was correctly sequenced?
Using NanoPlot and having a look to the main read length.

2. Trimming

When Nanopore reads are being sequenced, the MinKnown software splits Fast5 reads into quality pass and quality fail. As we will select only Fast5 pass reads, we won't need to perform a quality trimming, so even if we see that the reads have a bad Phred score, we know that the ONT software considered the reads as "good quality".

Then we will only be performing a read length trimming. As we are using amplicon sequencing data, we won't be expecting reads smaller than 400 nucleotides, nor higher than 600, which would obviously correspond to chimeric reads.

  1. Search for artic tool
  2. Select ARTIC guppyplex Filter Nanopore reads by read length and (optionally) quality
  3. While pressing the Ctrl key, select the three samples
  4. Remove reads longer than = 600
  5. Remove reads shorter than = 400
  6. Do not filter on quality score (speeds up processing) = Yes (we had already select pass reads)

nanofilt_run

We will come across one error in this job:

artic_error

This happens because Galaxy does not have the software to filter SARS-CoV-2 amplicon data properly installed in their server, which is something typical that we can find in Galaxy.